Skip to content

Optimising sortperm!#90

Draft
shreyas-omkar wants to merge 3 commits into
JuliaGPU:mainfrom
shreyas-omkar:sh/sort-optim
Draft

Optimising sortperm!#90
shreyas-omkar wants to merge 3 commits into
JuliaGPU:mainfrom
shreyas-omkar:sh/sort-optim

Conversation

@shreyas-omkar

Copy link
Copy Markdown
Member

sortperm! Throughput (Float32)

Array Size ($n$) Before After Speedup
$2^{14}$ 0.215 ms 0.197 ms 1.1×
$2^{18}$ 0.541 ms 0.267 ms 2.0×
$2^{20}$ 2.185 ms 0.490 ms 4.5×
$2^{22}$ 10.668 ms 2.836 ms 3.8×
$2^{24}$ 53.453 ms 11.904 ms 4.5×

Note: At $n = 16\text{M}$ ($2^{24}$), sortperm! performance is now within 1.3× of a raw, in-place sort! operation.


sort! with by= Transform (Float32, $n = 2^{22}$)

Transformation Case Before After Speedup
identity (No-op / Baseline) 1.674 ms
by=abs 2.197 ms 1.891 ms 1.16×
by=x->x^2 ~2.350 ms 1.873 ms 1.25×

shreyas-omkar and others added 2 commits June 24, 2026 13:33
…er at large n)

merge_sortperm_lowmem! carries a comparator that dereferences v[ix] and v[iy]
from global memory on every binary-search step inside the merge pass, making
the effective traffic O(n log²n). merge_sortperm! instead copies the keys into
shared memory alongside the indices so all comparisons stay in L1/shared memory.

Benchmarks on RTX 5080 (CUDA 13.2, Julia 1.12):
  n=2^18:  0.541 ms → 0.286 ms  (1.9×)
  n=2^20:  2.185 ms → 0.490 ms  (4.5×)
  n=2^22: 10.668 ms → 2.847 ms  (3.7×)
  n=2^24: 53.453 ms → 11.900 ms (4.5×)

sortperm! is now within 1.3× of plain sort! across all tested sizes.

The public temp kwarg is preserved: it maps to temp_ix in merge_sortperm!
(same semantics — a pre-allocated index swap buffer).

Tests: extend sortperm testset with full permutation-validity checks, 6 new
element types (Int16/UInt16/Int64/UInt64/Float64/UInt8), edge sizes (n=1..2049),
data-distribution coverage, comparator options, temp-reuse, exact Base.sortperm
match, and a merge sort stability check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without hoisting, the by(elem) transform fires inside every binary-search
comparison step across all O(n log²n) merge operations. With hoisting, we
broadcast by.(v) once to build a key array, then delegate to merge_sort_by_key!
which keeps keys in shared memory alongside values.

Benchmarks on RTX 5080 (Float32, n=2^22):
  by=abs:         2.197 ms → 1.912 ms  (-13%)
  by=x->x^2:     was worse  → 1.920 ms
  rev=true:       unchanged  (no by, not hoisted)
  identity:       unchanged  (guarded by by !== identity check)

The temp kwarg maps to temp_values in merge_sort_by_key! preserving the
existing API contract. All paths (sort!, merge_sort!, merge_sort_by_key!)
now benefit automatically for any non-identity by= function.

Tests: add sort_by_transform testset with exact Base.sort output matching
for Float32/Float64/Int32, edge sizes (n=1,2,513,1025), temp kwarg forwarding,
type-changing by= (Float32→Bool), and identity/rev=true non-regression checks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@christiangnrd

Copy link
Copy Markdown
Member

KA has supports_float64(::Backend)

@shreyas-omkar

Copy link
Copy Markdown
Member Author

KA has supports_float64(::Backend)

Yea.. I checked it now. Thanks for giving up heads up. I will change the commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants